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Abstract 

Approximating non-linear kernels using fea- 
ture maps has gained a lot of interest in re- 
cent years due to applications in reducing 
training and testing times of SVM classifiers 
and other kernel based learning algorithms. 
We extend this line of work and present low 
distortion embeddings for dot product ker- 
nels into linear Euclidean spaces. We base 
our results on a classical result in harmonic 
analysis characterizing all dot product ker- 
nels and use it to define randomized feature 
maps into explicit low dimensional Euclidean 
spaces in which the native dot product pro- 
vides an approximation to the dot product 
kernel with high confidence. 



1 Introduction 

Kernel methods have gained much importance in ma- 
chine learning in recent years due to the ease with 
which they allow algorithms designed to work in lin- 
ear feature spaces to be applied to implicit non lin- 
ear feature spaces. Typically these non linear fea- 
ture spaces are high (often infinite) dimensional and 
in order to avoid incurring the cost of explicitly work- 
ing in these spaces, one invokes the well known ker- 
nel trick which exploits the fact that the algorithms 
in question interact with data solely through pair- 
wise inner products. For example, instead of directly 
learning a hyperplane classifier in K d , one considers 
a non linear map $ : R d — > T~L such that for all 
x,y € R d , ($(x),$(y))^ = K(x,y) for some easily 
computable kernel K. One then tries to learn a clas- 
sifier H ' x i y w T $(x) for some weW. 

However, one is faced with the problem of representa- 
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tion in these non linear feature spaces and is at the risk 
of incurring the curse of dimensionality The solution 
to this problem comes in the form of Representer The- 
orems (see Argyrioua et al. 2009| for recent results) 



which act as an implicit dimensionality reduction step 
by giving us an assurance that the object (s) of inter- 
est, for example the normal vector to the hyperplane 
w in the case of classification and non-linear regres- 
sion, the cluster centers in the case of kernel fc-means, 
or the principal components in the case of kernel PCA, 
would necessarily lie in the span of the non-linear fea- 
ture maps of the training vectors in the respective ex- 



amples (see Scholkopf and Smola 2002). For instance, 
in case of the SVM algorithm, the result ensures that 
the maximum margin hyperplane in T-L would neces- 
sarily be of the form w — J2 a i^( x i) where are 
the training points. In case of SVM regression and 
classification, such a result is arrived at by applica- 
tion of the Karush-Kuhn- Tucker conditions whereas in 
the other two applications, the respective formulations 
themselves yield such a result. 

Whereas this appears to solve the problem of the curse 
of dimensionality, it actually paves the way for an en- 
tirely new kind of curse - one that we call the Curse 
of Support. In order to evaluate the output of the 
algorithms on test data, say in the case of SVM clas- 
sification, one has to compute the kernel measures of 
the test point with all the training points that partici- 
pate in defining the normal vector w. This cost can be 
prohibitive if the support is large. Unfortunately this 
is almost surely the case with large datasets as demon- 



strated by several results (Steinwart 



and Christmann 2008 Bengio et al. 



2003 Steinwart 



2005 ) which pre- 



dict an unbounded growth in the support sizes with 
growing training set sizes. A similar fate awaits all 
other kernel algorithms that use the support vector 
effect in order to avoid explicit representations. 

This presents a dilemma where a large training set is 
beneficial in obtaining superior generalization proper- 
ties but is simultaneously responsible in slowing the 
algorithms' predictive routines. There has been a lot 
of research on SVM formulations with sparsity pro- 
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moting regularizes (see for example Bi et al. 20031 



and support vector reduction (see for example Cos- 
salter et al. 20111. However, although these efforts 



have yielded rich empirical returns, they have neither 
addressed other kernel algorithms nor approached the 
question behind the curse in a systematic way. 

2 Related Work 



In a very elegant result, Rahimi and Recht (20071 



demonstrated how this curse can be beaten by way 
of low-distortion embeddings. Their result, building 
upon a classical result in harmonic analysis called 



Bochncr's Theorem (refer to Rudin 1962), shows how 
to, in some sense, embed the non-linear feature space 
(i.e. H, the Reproducing Kernel Hilbert Space asso- 
ciated with the kernel K) into a low dimensional Eu- 
clidean space while incurring an arbitrarily small ad- 
ditive distortion in the inner product values. More 
formally they constructed randomized feature maps 
Z : R d ->• R D such that for x,y e R d , (Z(x), Z(y)) « 
fsT(x, y) with very high probability. 

This allows one to overcome the curse of support in a 
systematic way for all the kernel learning tasks men- 
tioned before since one may now work in the explicit 
low dimensional space R D with explicit representa- 
tions whose complexity depends only on the dimen- 
sionality of the space. Their contribution is remi- 



niscent of Indyk and Motwani (1998) who perform 



low distortion embeddings (by invoking the Johnson- 
Lindenstrauss Lemma) in order to overcome the curse 
of dimensionality for the nearest neighbor problem. 

Subsequently there has been an increased interest in 
the kernel learning community toward results that al- 
low one to use linear kernels over some transformed 
feature space without having to sacrifice the benefits 



provided by non-linear ones. Rahimi and Recht (2007) 



considered only translation invariant kernels i.e. ker- 
nels of the form K (x, y) = /(x — y) for some positive 
definite function / : R d — >• R. Subsequently Li et al 



(2010) generalized this to a larger class of group in- 



variant kernels while still invoking Bochner's theorem. 



Maji and Berg (2009) presented a similar result for 



the intersection kernel (also known as the min ker- 

d 

nel) K(x,y) = ^2 min{xi,yj} which was generalized 



by |Vedaldi and Zisserman" (2010) to the class of ad- 



ditive homogeneous kernels K(x,y) — ^fc^x^y;) 

i=l 

where ki[x,y) — (xy) 2 ft (log x — log y) for some 7 6 K 
and positive definite functions /, 



Vcr 



pati et al. ( |2010 ) extended this idea to provide fea- 
ture maps for RBF kernels of the form K(x, y) = 



exp (— 2^2X 2 ( X 7 y)) where \ 2 is the Chi-squared dis- 
tance measure. 

There have been approaches that try to perform em- 
beddings in a task dependent manner (see for example 
Perronnin et al. 2010). The idea of directly consider- 



ing low-rank approximations to the Gram matrix has 



also been explored (see for example Bach and Jordan 



2005). However, the approaches considered in Rahimi 



and Recht (2007) and Vedaldi and Zisserman (2010) 



are the ones that most directly relate to this work. 
2.1 Our Contribution 

In this work we present feature maps approximating 
positive definite dot product kernels i.e kernels of the 
form _fT(x, y) = /((x, y)) for some real valued func- 
tion / : R — > K. More formally we present feature 
maps Z : R d — >• R D (where we refer to R d as the in- 
put space and R D as the embedding space) such 
that for all x,y e R d , (Z(x), Z(y)) if(x,y) with 
very high probability. We base our result on a char- 
acterization of real valued functions / that yield such 
positive definite kernels. We also demonstrate how 
our methods can be extended to compositional kernels 
of the form K co (x,y) = K dp (if(x,y)) where K dp is 
some dot product kernel and K is an arbitrary positive 
definite kernel. 

The kernels covered by our approach include homo- 
geneous polynomial kernels which are not covered by 
Vedaldi and Zisserman's treatment of homogeneous 
kernels as these are inseparable kernels which their ap- 
proach cannot handle. 

In the following, vectors shall be denoted in boldface. 
Xj denotes the i th Cartesian coordinate of a vector x. 
B p (0, r) denotes the set jx e % : ||x|| p < rj for some 

inner product space T-L (or some finite dimensional Eu- 
clidean space R d ). In particular, B\ (0, 1) and B2 (0, 1) 
denote set of points with less than unit 1-norm and 2- 
norm respectively. ||-|| without any subscripts denotes 
the 2-norm. 

3 A Characterization of Positive 
Definite Dot Product Kernels 

The result underlying our feature map constructions 
is a characterization of real valued functions on the 
real line that can be used to construct positive definite 
dot product kernels. This is a classical result in har- 
monic analysis due to Schoenberg (1942), that charac- 
terizes positive definite functions on the unit sphere in 
a Hilbert space. Our first observation, formalized be- 
low, is simply the fact that the restriction to the unit 
sphere is not crucial. 
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Theorem 1. A function f : K — > K defines a pos- 
itive definite kernel K : B 2 (0, 1) x B 2 (0, 1) — > R 
as K : (x, y) t— > /((x,y)) iff f is an analytic func- 
tion admitting a Maclaurin expansion with only non- 

oo 

negative coefficients i.e. f{x) = ^2 a n x n ,a n > 0, 

n = 0, 1, 2, . . .. Here B 2 (0, 1) C H for some Hilbert 
space W. 

Proof. We first recollect Schoenberg's result in its orig- 
inal form 



Theorem 2 (Schoenberg ( 1942), Theorem 2). A func- 
tion f : [—1,1] — > K constitutes a positive definite 
kernel K : S x x -> R, K : (x,y) m- /«x,y» 
iff f * s an analytic function admitting a Maclau- 
rin expansion with only non-negative coefficients i.e. 

oo 

f( x ) = E a n x n ,a n > 0,n = 0,1,2,.... Here = 

n=0 

{xeM: ||x|| 2 = 1} for some Hilbert space H. 

To see that the non-negativeness of the coef- 
ficients of the Maclaurin expansion is necessary 
just apply Theorem [2] to points on S^. Since 
{(x,y) :x,yefi 2 (0,l)} = {(x,y) : x, y G 5^}, the 
result extends to the general case when the points are 
coming from B% (0, 1). To see that this suffices we 
make use of some well known facts regarding positive 



definite kernels (for example refer to Scholkopf and 
Smolal 120021). 



Fact 3. // K n ,n 6 N are positive definite kernels 
defined on some common domain then the following 
statements are true 

1. c rn K m + c n K n is also a positive definite kernel 
provided c m ,c n > 0. 

2. K m K n is also a positive definite kernel. 

3. If lim K n — K and K is continuous then K is 

n— >oo 

also a positive definite kernel. 

Starting with the fact that the dot product kernel 
is positive definite on any Hilbert space %, applying 
Fact |3|1| and Fact |3|2| we get that for every n £ N, 

n 

the kernel i4T„(x, y) — ^2 o-i (x, y) is positive definite. 

i=0 

An application of Fact |3|3| along with the fact that the 
Maclaurin series converges uniformly within its radius 
of convergence then proves the result. □ 



Actually Schoenberg ( 1942 1 shows that a function / 



need only have a non-negative expansion in terms 
of Gegenbauer polynomials in order to yield a posi- 
tive definite kernel over finite dimensional Euclidean 
spaces (a condition weaker than that of Theorem [I]) . 



However, functions / that do not have non-negative 
Maclaurin expansions are not very useful because they 
yield kernels that become indefinite after the dimen- 
sionality crosses a certain threshold. This is because a 
dot product kernel that is positive definite over all fi- 
nite dimensional Euclidean spaces is also positive def- 
inite over Hilbert spaces (see the Section 3.1 for the 
simple proof). 

Most dot product kernels used in practice (see 



Scholkopf and Smola 2002 ) satisfy the stronger condi 



tion of the Maclaurin expansion having non-negative 
coefficients and our results readily apply to these. 

We note that, as a corollary of Schoenberg's result, all 
dot product kernels are necessarily unbounded over 
non-compact domains. This is in stark contrast with 
translation invariant kernels that are always bounded 



(see Rudin 1962 for a proof). Hence from now on we 



shall assume that our data is confined to some compact 
domain Q C R . In order to study the behavior of our 
feature maps as this domain grows in size, we shall 
assume that f2 C B\ (0, R) for some R > 0. 

We shall assume that the function / is defined and 
differentiable on a closed interval [—/,/]. The value of 
/ shall be dictated by the value of R chosen above. If / 
is defined only on an open interval (—7, 7) around zero 
(as is the case when the Maclaurin series has a finite 
radius of convergence) then we can choose a scalar 
c > ^, define g = f (f ) and use g to define a new 
kernel K g . This has the implicit effect of scaling the 
data vectors in input space W 1 down by a factor of c. 



3.1 Positive definite dot product kernels over 
finite dimensional spaces 

As noted in the main paper, the original result of 
Schoenberg characterizing functions that yield a posi- 
tive definite dot product kernel over finite dimensional 
Euclidean spaces in terms of those admitting positive 
Gegenbauer expansions is not very useful in practice. 
This is because of two reasons. Firstly as we shall 
show below, functions that have non-negative Gegen- 
bauer expansions include those that yield positive def- 
inite kernels only up to a certain dimensionality i.e. 
these kernels are positive definite up to K d ° for some 
fixed d and indefinite on all Euclidean spaces of di- 
mensionality d > do- Secondly, from an algorithmic 
perspective, the Gegenbauer expansions do not seem 
amenable to the type of feature construction methods 
described in this paper - this is because Gegenbauer 
polynomials themselves admit negative coefficients. 

The result characterizing positive definite functions 
over Hilbert spaces in terms of positive Maclaurin ex- 
pansions on the other hand is appealing for the very 
same reasons - functions satisfying this stronger con- 
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dition are positive definite over all finite dimensional 
spaces and the method readily lends itself to feature 
construction methods. 

Lemma 4. A function f : K — > K yields positive defi- 
nite dot product kernels over all finite dimensional Eu- 
clidean spaces iff it yields positive definite dot product 
kernels over Hilbert spaces. 



Proof. We shall first prove this result for the special 
case of £2, the Hilbert space of all square summable 
sequences. Schoenberg's result (Corollary [I]) will then 
allow us to extend it to all Hilbert spaces. The if part 
follows readily from the observation that £2 contains 
all finite dimensional Euclidean spaces as subspaces 
and the fact that any kernel that is positive definite 
over a set is positive definite over all its subsets as well. 

For the only if part consider any set of n points 
S = {xi, X2, . . . , x„} C £2- Clearly there exists an 
embedding $ : S — > R™ such that for all i,j £ 
[n], ($(xi), 5>(xj)) = (xi,Xj) (note that the left and 
the right hand sides are inner products over different 
spaces). Such an embedding can be constructed, for 
example, by taking the Cholesky decomposition of the 
Gram matrix given by the inner product on £2 (the en- 
tries of the Gram matrix are finite by an application 
of Cauchy-Schwarz inequality). 

Consider the matrix A = [a^] where Oy = 
/ (($(xj), $(xj))). Since / yields positive definite ker- 
nels over all finite dimensional Euclidean spaces, we 
have A y 0. However, by the isometry of the em- 
bedding, we have a^- = / ((xj, x_y)). Hence, for any 
n < 00, for any arbitrary n points, the gram matrix 
given by /((•,•)) is positive definite (here (•,•) is the 
dot product over £2). Thus / yields a positive definite 
kernel over £2 as well. 

To finish off the proof we now use Schoenberg's the- 
orem to extend this result to all Hilbert spaces. If a 
dot product kernel is positive definite over all finite 
dimensional spaces then the above argument shows 
it to be positive definite over £2- Hence, by Corol- 
lary [l] the function / defining this kernel must have 
a non-negative Maclaurin's expansion. From here on 
an argument similar to the one used to prove the suf- 
ficiency part of Corollary [l] (using Fact [3]) can be used 
to show that this kernel is positive definite over all 
Hilbert spaces. 

On the other hand, if a dot product kernel is positive 
definite over Hilbert spaces, then we use its positive- 
definiteness over £2, along with the argument used in 
showing the if part above, to prove that the kernel is 
positive definite over all finite dimensional Euclidean 
spaces. □ 



An easy application of Corollary [T] then gives us the 
following result : 

Corollary 5. A function f : R — > R yields positive 
definite kernels over all finite dimensional Euclidean 
spaces iff it is an analytic function admitting a Maclau- 
rin expansion with only non-negative coefficients. 

However, we note that even functions that have 
only positive Gegenbauer expansions (and not pos- 
itive Maclaurin expansions) may admit low dimen- 
sional feature maps. This is indicated by the Johnson- 



Lindenstrauss Lemma (for example see Indyk and 



Motwani 1998) that predicts the existence of low- 



distortion embeddings from arbitrary Hilbert spaces 
(thus, in particular from the reproducing kernel 
Hilbert spaces of these kernels) to finite dimensional 
Euclidean spaces. Interestingly, it is very tempting to 



view the constructions of Rahimi and Recht ( 2007 1 and 



Vedaldi and Zisserman (2010) (among others) as algo- 
rithmic versions of the Johnson-Lindenstrauss Lemma. 
The challenge in all such cases, however, is to make 
these constructions explicit, uniform, as well as algo- 
rithmically efficient. 

3.2 Examples of Positive Definite Dot 
Product Kernels 

The most well known dot product kernels are the poly- 
nomial kernels which are used in either a homoge- 
neous form (K(x, y) = (x, y) p for some p € N) or 
a non-homogeneous form (K(x,y) = ((x, y) + r) p for 
some p £ N,r £ K + ). Lesser known examples include 
Vovk's real polynomial kernel (A'(x,y) = 
for some p £ N), Vovk's infinite polynomial kernel 
(K(x, y) = y ) ) and the exponential dot product 

kernel (K(x,y) = cxp ( ^ x a P ^ for some a £ M). 



It is interesting to note that due to a result by |Stein- 



wart 



(20011, the last two kernels (Vovk's infinite ker- 



nel and exponential dot product kernel) are universal 
on any compact subset S C K d which means that the 
space of all functions induced by them is dense in C(S) , 
the space of all continuous functions defined on S. The 
widely used Gaussian kernel is actually a normalized 
version of the exponential dot product kernel. How- 
ever Vovk's kernels are seldom used in practice since 
they are expected to have poor generalization proper- 
ties due to their flat spectrum as noted by|Sch61kopf 
and Smola|(|2002[). 



4 Random Feature Maps 

Schoenberg's result naturally paves the way for a re- 



sult of the kind presented in Rahimi and Recht (2007) 



in which we view the coefficients of the Maclaurin's ex- 



Purushottam Kar, Harish Karnick 



pansion as a positive measure denned on N U {0} and 
define estimators for each individual term of the ex- 
pression. However, as we shall see, estimating higher 
order terms in our case will require more random- 
ness. Thus, a set of coefficients {a n } defining a heavy 
tailed distribution would entail huge randomness costs 
in case the expansion has a large (or infinite) number 
of terms. For example the sequence a n = -j has a 
linear rather than an exponential tail. 

To address this issue we do not utilize the coefficients 
as measure values, rather we impose an external distri- 
bution on NU {0} having an exponential tail. The dis- 
tribution that we choose to impose is P [N = n] = 
for some fixed p > 1. In practice p = 2 is a good choice 
since it establishes a normalized measure over NU {0}. 
We will, using this distribution, obtain unbiased es- 
timates for the kernel value and prove corresponding 
uniform convergence results. 

We stress that the positiveness of the coefficients {a n } 
is still essential for us to be able to provide an embed- 
ding into real spaces. If the coefficients are allowed 
to be negative, the resulting kernels would no longer 
remain positive definite and we would only be able to 
provide feature maps that map to pseudo-Euclidean 
spaces. It turns out that the imposition of an exter- 
nal measure is crucial from a statistical point of view 
as well. As we shall see later, it allows us to obtain 
bounded estimators which in turn allow us to use Ho- 
cffding bounds to prove uniform convergence results. 

We now move on to describe our feature map : our 
feature map will essentially be a concatenation of sev- 
eral copies of identical real valued feature maps. These 
copies will reduce variance and allow us to prove con- 
vergence bounds. The following simple fact about ran- 
dom projections is at the core of our feature maps. 

Lemma 6. Let u) G R d be a vector each of whose 
coordinates have been chosen pairwise independently 
using fair coin tosses from the set {—1, 1} and consider 
the feature map Z : R d -> R. Z : x i-> cd t x. Then for 
all xjetf,E[Z(x)Z(y)] = (x,y). 



Proof. We have E [Z(x)Z(y)] = E [w T x • uj T y] 



0= (x,y) 



where in the third equality we have used linearity of 
expectation and the pairwise independence of the dif- 
ferent coordinates of u>. The fourth equality is arrived 
at by using properties of the distribution. Notice that 
any distribution that is symmetric about zero with 
unit second moment can be used for sampling the co- 
ordinates of uj. This particular choice both simplifies 
the analysis as well as is easy to implement in prac- 
tice. □ 



We now present a real valued feature map for the 
dot product kernel. First of all we randomly pick a 
number N e N U {0} with P [N = n] = ^+t. Next 
we pick N independent Rademacher vectors uj\ . . . wjv 
and output the feature map Z : R d — > R, Z : x i-> 

N 

yaNP N+1 II W 7 X - We first of all establish that the 

linear kernel obtained by using this feature map gives 
us an unbiased estimate of the kernel value at each 
pair of points chosen from the domain f2. 

Lemma 7. Let Z : R d — > K be the feature map 
constructed above. Then for all x, y G 7 we have 
E [Z(x)Z(y)] = K(x., y) where the expectation is over 
the choice of the Rademacher vectors. 



Proof. We have E [Z(x)Z(y)} 



E [Z(x)Z(y) 



N 



a N p 



N+l 



E 



N 



N 



IW x IWy 

i=i j=i 
a NP N+1 (E [aTx-aTy])" 



E[<W A ! 1 (x.y> 



N 



oo 1 



n=0 



= E 



= E 




^U>2 x . y . + ^ U ,. W . X .y j 



1=1 



i¥=3 

d 



= ^E[^] X4 y J + ^EK]E[^]x iy , 



i=i 



i¥=3 



where the first step uses the fact that the index ./V and 
the vectors u>i are chosen independently, the fourth 
step uses the fact that the vectors Wj are chosen in- 
dependently among themselves and the fifth step uses 
Lemma 2. □ 

Having obtained a feature map giving us an unbiased 
estimate of the kernel value, we move on to establish 
bounds on the deviation of the linear kernel given by 
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this map from its expected value. To do this we ob- 
tain D such feature maps independently and concate- 
nate them to obtain a multi dimensional feature map 
ZiR^M^Zixh^ (^i(x), . . . , Z D (x)). It is 
easy to see that E [(Z(x), Z(y))] = K(x,y). Moreover, 
such a concatenation is expected to guarantee an expo- 
nentially fast convergence to K (x, y) using Hoeffding 
bounds. However this requires us to prove that the es- 
timator corresponding to our feature map i.e Z(x.)Z(y) 
is bounded. This we establish below : 

Lemma 8. For all x,ye!!, \Z(x)Z(y)\ < pf(pR 2 ). 



Proof. Since Z(x)Z(y) 



a N p 



AM 



N N 

1 n «J x n «jy> 

j=l 3 = 1 

by Holder's inequality we have, for all j, \u>Jx\ < 
|| w j || < R since every coordinate of ujj is ei- 

ther 1 or —1 and x E f2 C B\ (0, R). A similar result 
holds for |wjy| as well. Thus we have \Z(x)Z(y)\ < 

„N+l R 2N < p . g a^K^ = pf{pR 2 ). 



a-NP 



□ 



n=0 



We note here that the imposition of an external mea- 
sure on N U {0} plays a crucial role in the analysis. In 
absence of the external measure, one is only able to 
bound the estimator by 0R 2N and since N is a poten- 
tially unbounded random variable, this makes applica- 
tion of Hoeffding bounds impossible. Although there 
do exist Hoeffding style bounds for unbounded random 
variables, none seem to work in our case. However, 
with the simple imposition of an external measure we 
obtain an estimator that is bounded by a value depen- 
dent on the range of values taken by the kernel over 
the domain, a very desirable quality. 

For sake of convenience let us denote pf(pR 2 ) by 
Cq since it is a constant dependent only on the size 
of the domain Q and independent of the dimension 
of the input space M. d . Note that this constant is 
proportional to the largest value taken by the ker- 
nel in the domain Q. This immediately tells us that 
for any x,y G Q, P [|(Z(x), Z(y)) - K (x,y)| > e] < 

2exp ^— J^fs-^ ■ However we can give much stronger 
guarantees than this - we can prove that this loss of 
confidence need not be incurred over every single pair 
of points but rather the entire domain at once. More 
formally, we can show that with very high probability, 
sup KZ(x),Z(y))-K(x,y)| < e. 

4.1 Uniform Approximation 

As stated before, we are able to ensure that the fea- 
ture map designed above gives an accurate estimate 
of the kernel value uniformly over the entire domain. 
For this we exploit the Lipschitz properties of the ker- 



Algorithm 1 Random Maclaurin Feature Maps 

Require: A positive definite dot product kernel 

X(x,y) = /«x,y}). 
Ensure: A randomized feature map Z : R d — > M. D 

such that (Z(x),Z(y)) w K(x.,y). 

oo 

Obtain the Maclaurin expansion of f(x) — ^2 a n% n 

n=0 

by setting a n = 1 J 0) . 
Fix a value p > 1. 
for i — 1 to D do 

Choose a non negative integer N E N U {0} with 

F[N = n] = 

Choose N vectors u>i, . . . ,oj n E { — 1, l} d select- 
ing each coordinate using fair coin tosses. 

N 

Let feature map Zi : x H> \J qnP n+1 Yi W 7 X - 



3 = 1 



end for 

Output Z : x m. (Z x (x), . . . , Z D (x)). 



nel function and our estimator. A similar approach 



was adopted by Rahimi and Recht (20071 to provide 



corresponding uniform convergence properties for their 
estimator. However it is not possible to import their 
argument since they were able to exploit the fact that 
both their kernel as well as their estimator were trans- 
lation invariant. We, having no such guarantees for 
our estimator, have to argue differently. 

Let £(x,y) = (Z(x),Z(y)) - K(x,y). We will first 
show that the function £(■,■) is Lipschitz over the 
domain fl. Since £(■,■) itself is differentiable (actu- 
ally analytic), its Lipschitz constant can be bounded 
by bounding the norms of its gradients i.e. it would 
suffice to show that sup ||V x £(x,y)|| < L and 

sup ||V y £(x, y)| < L for some constant L. This 

x.yGfi 

would ensure that if the error incurred by the feature 
map is small on a pair of vectors then it would also be 
small on all pairs of vectors that are "close" to these 
vectors. This is formalized in the following theorem : 

Lemma 9. If a bivariate function f defined over f2 C 

M. d is L-Lipschitz in both its arguments then for every 

x, y G Q, sup |/(x, y) - /(x', y')| < 2Lr. 
x'eB 2 (x,r)nn 
y'eB 2 (y,r)nn 

Proof. We have |/(x, y) - /(x', y')| < 
|/(x,y)-/(x,yO| + |/(x,y')-/(x',y')| < 
L ■ l|y-y'll + L ■ ll x - x 'll < 2Lr where in the 
second step we have used the fact that x, y' G Q. □ 

What this allows us to do is choose a set of points 
T that set up an e-net over the domain f2 at some 
scale e\. If we can ensure that the feature maps pro- 
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vide an (e/2)-close approximation to K at the cen- 
ters of this net i.e. sup |£(x, y)| < e/2, then the 
x,yer 

above result would show us that if the error func- 
tion £{■, •) is L-Lipschitz in both its arguments, then 
sup |£(x, y)| < e/2 + 2Le\ since the e-net ensures 

x,y£f2 

that for all x, y € f2, there exists x', y' e T such that 
||x — x'|| , ||y — y'|| < e\. Thus choosing ei = en- 
sures that sup |(Z(x),Z(y)) - K(x,y)\ < e. 

Now ensuring that the feature maps provide a close ap- 
proximation to the kernel value at all pairs of points 
taken from T would cost us a reduction in the con- 
fidence parameter by a factor of |T| 2 due to taking a 



union bound. It is well known (for example see Cucker 
and Smale 2001 ) that setting up an e-net at scale ei 



in d dimensions over a compact set of diameter A 
takes at most centers. In our case A < 2R 

since Q C B x (0, R) C B 2 (0, R) and e x = ^ i.e. 

in < (spy. 

Wc now move on to the task of bounding the Lipschitz 
constant of the error function. Since £ (•, •) is sym- 
metric in both its arguments, it is sufficient to bound 
||V x £(x,y)|| < ||V x (Z(x),Z(y))|| + ||V x X(x,y)||. We 
will bound these two quantities separately below. 
Lemma 10. We have the following : 

sup ||V x tf(x,y)|| < Rf(R 2 ) 

x,y£f! 

sup ||V y #(x,y)|| < Rf(R 2 ) 

x,y6fi 



Proof. We have V x if(x,y) = V x ( E a n ( x ,y)™ 



E a„V x (x,y) n = y E na n (x, y 

n— n 

we have ||V x i^(x, y)|| = 



y E na n (x,y) n 

n=0 



Thus 



< 



R E na n \{^y)\ n - 1 < R E na n (R 2 y 



Rf'{R 2 ) 



where in the second and the third step we have used 
the fact that x, y G fi C B x (0, R) C B 2 (0, R). Simi- 
larly we can show sup ||V y Jf(x,y)|| < Rf'{R 2 ). □ 

x,yGO 

Lemma 11. We have the following : 

sup ||V x (Z 1 (x)Z 1 (y))|| < p 2 RVdf( P R 2 ) 

x,yGf2 

sup ||V y (Z 1 (x)Z 1 (y))|| < p 2 RVdf( P R 2 ) 



Proof Since (Z(x),Z(y)) = i £ Z,(x)Z l (y) and 
V x (Z(x), Z(y)) 4EVx (Z,(x)^(y)) wc have 



||V x (Z(x),Z(y))|| < i E ||V X (^(x)^(y))ll by tri- 

i=l 

angle inequality. Since all the Zj feature maps 
are identical it would be sufficient to bound 
|| V x (Zi(x.)Zi(y))\\ and by the above calculation, the 
same bound would hold for ||V X (Z(x), Z(y))|| as well. 

N 



Let Z\ 1 x 1 y yJa N p N+1 J7 wjx for some N < k. 
3=1 

Thus we can bound the quantity V x (Zi(x)Zi(y)) 

/ N N \ 

as V x onp n+1 Yl a; 7 x I! w Jy which simplifies 



/ AT \ / N \ 

to a^p N+1 Yl ^Jy ^x II U '7 X an d further to 

Lv^ +i fUjy) f (n«< T *W 

\ J' =1 / 3=1 \i& J 

We note that for any u; chosen, ||w| = Vd. More- 
over, as we have seen before, for any u>, sup |w T x| < 

R by Holder's inequality. Thus we can bound 
||V x (£i(x)Zi(y))|| as 



A? 



JV 



N 



3=1 
iV 



aAr ^ +i n | W T y | 



N 



e ik* 



/V 



< 0JVP ™ n l«7y| E ll k T *l ll<"< 



u' =1 

N 



3 = 1 \i^3 



< 



,N- 



a N p 



3=1 



< p 2 R\QJ2 n an(pR 2 ) n ~ 1 = p 2 RVdf'(pR 2 ) 

n=Q 

where we have used the triangle inequal- 
ity in the third step. Similarly we can show 
sup ||V y (Z 1 (x)Z 1 (y))|| <p 2 RVdf'(pR 2 ). □ 

x.yGfi 

Thus we have L = sup ||V x £(x,y)|| < Rf'(R 2 ) + 

x.yeii 

p 2 R\fdf'{pR 2 ). Putting all the results together, we 
first have by application of union bound that the 
probability that the feature map will fail at any 
pair of points chosen from the e-net is bounded by 
2 ( 32 ^ L ) 2d exp ^— J^t) ■ The covering argument along 
with the bound on the Lipschitz constant of the error 
function ensure that with the same confidence, the fea- 
ture map would provide an e-accurate estimate on the 
entire domain tt. Thus we have the following theorem. 
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Theorem 12. Let Q C B x (0, R) be a com- 
pact subset of M. d and K(x,y) = /((x,y)) 
be a dot product kernel defined on tt. Then, 
for the feature map Zi defined in Algorithm [7J 



have 



sup |(Z(x),Z(y))-/v(x,y)| >e 



< 



(^) 2d ex P (-g) 



whe 



Co - Pf(pR 2 



p 2 RVdf( P R 2 



for some small con- 

' dCo t / RL 

, e<5 



( --■-■•'--) exp 
L = Rf'(R 2 ) 

stant p > 1. Moreover, with D = O (^^r 1 log I 
one can ensure the same with probability greater than 
1-6. 

The behavior of this bound with respect to the di- 
mensionality of the input space, the accuracy pa- 
rameter and the confidence parameter is of the form 
D = Q ( ^ lo g (jg)) that matches that of Rahimi and 
Recht ( 2007[ ). The bound has a stronger dependence 
on kernel specific parameters which appear as non- 
logarithmic terms due to the unbounded nature of the 
dot product kernels. Even so, the kernel specific term 
Co is dependent on the largest value taken by the ker- 
nel in the domain f2, a dependence that is unavoid- 
able for an algorithm giving guarantees on the absolute 
(rather than relative) deviation from the true value. 

4.2 An Alternative Feature Map 

An alternative method to bounding the amount of ran- 
domness being used is to truncate the Maclaurin series 
after a certain number of terms and use the resulting 
function to define a new kernel. Since the Maclaurin 
series of an analytic function defined over a bounded 
domain converges to it uniformly, we can truncate the 
series while incurring a uniformly bounded error. A 
similar approach is used in |Vedaldi and Z isserman 
(2010) to present deterministic feature maps. Sup- 



pose we have a positive definite dot product kernel K 
defined on a domain f2 C B\ (0, R) in some Euclidean 

oo 

space K d by a function f(x) = E a n x n . If we choose 

ra=0 

fc 

k = k(e,R) such that E a n R 2n = f(R 2 ) - e (or se- 

n=0 

lect some set S c N U {0} such that £ a n R 2n = 

nGS 

f(R 2 ) — e and |5| = fc) and create a new kernel 

fc 

_KT(x, y) = E a n (x, y) , then the residual error Rk = 

n=0 



sup 

x,ySO 



#(x,y)-#(x,y) 



sup 

x,y6fi 



E a n (x, y) 

i=k+l 



< 



E a n R 2n < e since fl C B x (0, R) C B 2 (0, R) 

=k+l 



and E a nR 



2)1 



f(R 2 



Thus for all x,y G SI, 



satisfies the conditions of Corollary JT] one can now 
obtain ei-accurate feature maps for K using the tech- 
niques mentioned above and those feature maps would 
provide an (e + ejj-accurate estimate to K. 

5 Generalizing to Compositional 
Kernels 

Given a positive definite dot product kernel -Kdp aud 
an arbitrary positive definite kernel K, the kernel K co 
defined as if co (x, y) = K dp (K(x, y)) is also positive 
definite. This fact can be deduced either by directly 
invoking a result due to FitzGerald et al. ( 1995 The- 



K(x,y) - e < if(x,y) < K(x,y) + e. Since K also 



orem 2.1) or by applying Schoenberg's result in con- 
junction with Mercer's theorem. We now show how 
to extend the result for dot product kernels to such 
compositional kernels. 

Note that plugging a translation invariant kernel into 
a dot product kernel yields yet another translation in- 
variant kernel since the set of translation invariant ker- 
nels is closed under powering, scalar multiplication and 
addition. However, a set of homogeneous kernels not 
sharing the homogeneity parameter is not closed un- 
der addition. Hence the set of homogeneous kernels is 
not closed under the operations mentioned above and 
thus, plugging a homogeneous kernel into a dot prod- 
uct kernel in general yields a novel non-homogeneous 
kernel. We also note that the results obtained in the 
section above can be now viewed as special cases of the 
result presented in this section with the dot product 
being substituted into a dot product kernel. 

In order to construct feature maps for the composi- 
tional kernel we assume that we have black-box ac- 
cess to a (possibly randomized) feature map selection 
routine A which when invoked, returns a feature map 
W : R d -)• K for K. If we assume that the kernel K 
is bounded and Lipschitz and that the feature map W 
returned to us is bounded, Lipschitz on expectation 
and provides an unbiased estimate of K, then one can 
design (using these feature maps for K) feature maps 
for K co . The analysis of the final feature map in this 
case is a bit more involved since we only assume black- 
box access to A and only expect the feature map to 
be Lipschitz on expectation. 

We first formally state the assumptions made about 
the kernel K and the feature maps returned by A : 

1. if is defined over some domain Jl C Mr. 

2. K is bounded i.e. we have sup \K(x, y)| < Ck 

x,y£!) 

for some C K <ER + . 

3. K is Lipschitz i.e. we have sup |j V x fC(x, y)|| < 
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Algorithm 2 Random Maclaurin Feature Maps for 
Compositional Kernels 

Require: A compositional positive definite kernel 
A co (x,y) = A dp (A(x,y)) = /(A(x,y)). 

Ensure: A randomized feature map Z : R d — > M. D 
such that (Z(x),Z(y)) « A co (x,y). 

oo 

Obtain the Maclaurin expansion of f(x) = E a n xU 

n=0 

by setting a n = 1 J 0) . 
Fix a value p > 1. 
for i = 1 to D do 

Choose a non negative integer iVeNU {0} with 
P[N = n] = ^ ¥T . 

Get N independent instantiations of the feature 
map for K from A as W\ , Wn ■ 

N 



Let feature map Zj : x i-> ^/ a N p N+1 J| Wj (x) . 
end for 

Output Z : x ^ ^ (Zx(x), . . . , Z D (x)). 



L^- and sup || V y K(x, y)| < Lk for some Lj<- e 
x,yen 



4. W is an unbiased estimator of K i.e. for all 
x,y e Si, E [VF(x)VF(y)] = A(x,y) where the ex- 
pectation is over the internal randomness of W. 

5. W 7 " is a bounded feature map i.e. there exists some 
C w e M+ such that sup |W(x)| < ^CV. 

xgfi 

6. W is Lipschitz on expectation i.e. for some L^v € 

E+, supE [||V x W(x)||] < L w . 
xeo 

Our feature map construction algorithm is similar to 
the one used for dot product kernels. We pick a non- 
negative integer N e NU{0} with P [N = n] = for 
some fixed p > 1 and output the feature map Z : R d — > 

, AT 

E, Z : x y /a N p N+1 n W M) where ^l. • • • . w n 

are independent instantiations of the feature map W 
associated with the kernel K. We concatenate D such 
feature maps to give our final feature map. 

It is clear that on expectation, the product of the 
feature map values is equal to the value of the ker- 
nel i.e. E 

N,W 1 ,...,W N 

lid 



Proof. Z(x)Z(y) = a N p N+1 ft Wj(x) ft ^(x). Us- 

ing the bound on the feature maps we get the inequal- 
ity \Z{x)Z{y)\<a N p N + 1 C^<pf{pC w ) □ 

Thus we have for any x, y <E Si, 
P[|(Z(x),Z(y)) - A co (x,y)| < e] with probability 

at least 1 — 2exp ^— fzff) w here C\ = pf(pC\y)- We 
now investigate the Lipschitz properties of K co and 
our feature map. 

Lemma 14. We have 

sup ||V x A co (x,y)|| < L K f'(C K ) 

x,y6f2 

sup ||V y A co (x,y)|| < L K f'(C K ) 



Proof. A comp (x,y) = E a n K(x,y) n . Thus we have 

n=0 

oo 

by linearity V x i<: comp (x, y) = E a n V x (A(x, y) n ) = 

n=0 

oo 

E na„A(x,y)"- 1 V x A(x,y) i.c || V x A comp (x, y)|| < 

n=0 

oo 

||V x ^(x,y)|| E ^C™" 1 < L K f'(C K ). Similarly 

n=0 

we have sup ||V y ^ C o(x, y)|| < L K f'(C K )- □ 
x,yen 



We next move on to the Lipschitz properties of Z. 
Since we have only made assumptions on the expected 
Lipschtiz properties of W, we would only be able to 
give guarantees on the expected Lipschitz properties 
of Z. However, as we shall see, these would be suffi- 
cient to provide a uniform convergence guarantee over 
the entire domain SI. As before, we find that by lin- 
earity of expectation, analyzing the expected Lipschitz 
properties of a single feature map Z are sufficient to 
guarantee, on expectation, similar properties for Z as 
well. 

Lemma 15. We have 

sup ||V x (Z(x)Z(y))|| < L wP 2 Vc^f(pC w ) 

x.yefi 

sup ||V y (Z(x)Z(y))|| < L wP 2 Vc^f(pC w ) 

x,yGO 



N 



[<Z(x),Z(y)>] = K comp (x,y) where Proo f- Sincc Z W Z & 



Zil^l^Zix^^ (^i(x), • • • , Z D (x)). Yet 
again we expect that the concatenation of D such fea- 
ture maps for a large enough D would provide us a 
close approximation to K co with high probability. For 
this we first prove that our feature map is bounded. 

Lemma 13. For allx,y e fi,|Z(x)Z(y)| <pf{pC w ). 



a NP N+1 EI W 3 {x)W 3 {y), 

3 = 1 

by linearity we can write V x Z(x)Z(y) = 
( a N p N +i ft Wj{y) ) £ ( n WiW J V x Wi-(x). 
Thus we can then write ||V x Z(x)Z(y) as 



a N p 



N+l 



ft wi(y) 



AT 



e n wi(x) v^-(x) 



Random Feature Maps for Dot Product Kernels 



N N N-l 

< a N p N+1 C^ £ C w 2 ||V x W,-(x)|| which gives 

us, by linearity of expectation and the bound on 
the expected Lipschitz properties of the individual 
estimators, 



E[||V x Z(x)Z(y)||] < Na N p N+i C 



W 



AV 



= Lwp 1 V Cw ■ Najy (pCw 
< L wP 2 ^C^f( P C w ) 

Similarly we have sup || V y (Z(x)Z(y))\\ 

x,y6Sl 

L w p 2 y/C^f'{pC w ). 



< 
□ 



Working as before we find that the error func- 
tion £(x,y) = (Z(x),Z(y)) - K co (x,y) is, on 
expectation, Li-Lipschitz for L\ = LkJ^Ck) + 
LwP 2 \fCwf'{pCw)- Hence the probability that the 
error function will not be ^-Lipschitz is less than 
by an application of Markov's inequality. However if 
this is not the case then constructing an e-net at scale 
r over the domain Q and ensuring that the estima- 
tor provides an e/2-approximation at centers of these 
points would ensure an e-accurate estimation to the 
kernel on the entire domain f2. Setting up such a net 

would require at most (^r) centers if fl C B\ (0,R). 
Adding the failure probabilities of the estimator not 
being accurate on the e-net centers to the probabil- 
ity of the error function not being Lipschitz gives us 
the total error probability of our estimator giving an 
inaccurate estimate over any point in the domain as 



2Lir 



^exp(-g) 

Looking at this quantity as of the form k\r~ d + kir and 
setting r = (j^J + gives us the error probability as 

2k^k^ < (32^i) exp (-^) if e < 8RL 1 which 
gives us the following theorem. 

Theorem 16. Let Q C B x (0, R) be a com- 
pact subset of M. d and if co (x, y) = Kd p (K(x., y)) 
be a compositional kernel defined on £1 satisfy- 
ing the necessary boundedness and Lipschitz con- 
ditions. Assuming we have black-box access to a 
feature map selection algorithm for K also satisfy- 
ing the necessary boundedness and Lipschitz condi- 
tions, for the feature map Z defined in Algorithm [1J 



we have P 

' Z2RL 



Kco(x.,y)\ > e 



< 



(^)-p(-^) 



sup |(Z(x),Z(y)) 

x,yGf2 

icJfd) w ^ ere Ci — Pf(pCw) an d L\ 
LwP 2 VCw f (pC-w) f or some small con- 



L K f'{C K ) 

stant p > 1. Moreover, with D = fi log (^n)) , 
one can ensure the same with probability greater than 
1-6. 



Yet again the dependence on input space parameters 
is similar to that in the case of dot product kernel 
feature maps. The only non-logarithmic kernel specific 
dependence is on C\ which encodes the largest possible 
value taken by the oracle features which is related to 
the range of values taken by the kernel K . 

6 Experiments 

In this section we report results of our feature map 
construction algorithm on both toy as well as bench- 
mark datasets. In the following, homogeneous kernel 
refers to the kernel JT/j(x, y) = (x,y) p , polynomial 
kernel refers to K p (x, y) = (1 + (x, y)) p and exponen- 
tial kernel refers to K e (x, y) = exp ^ j . In all our 
experiments we used p = 10 and set the value of the 
"width" parameter a to be the mean of all pairwise 
training data distances, a standard heuristic. We shall 
denote by d the dimensionality of the original feature 
space and D to be the number of random feature maps 
used. Before we move on, we describe a heuristic which 
when used in conjunction with random feature maps 
gives attractive results allowing for accelerated train- 
ing and testing times for the SVM algorithm. 

6.1 The Heuristic HO/1 

Consider a dot product kernel defined by K(x, y) = 

oo 

^2 a n (x, y) . This heuristic simply makes an obser- 

71=0 

vation that the first two terms of this expansion need 
not be estimated at all. The first term, being a con- 
stant, can be absorbed into the offset parameter of 
SVM formulations and the second term can be han- 
dled by simply adjoining the random features with the 
original features. This allows us to use all our ran- 
domness in estimating higher order terms. We refer 
to algorithmic formulations that use this heuristic as 
HO/1 and those that use only random features as RF. 

We note some properties of this heuristic. First of all, 
as we shall see, HO/1 offers superior accuracies even 
when using a very small number of random features 
since we get away with an exact estimate of the leading 
terms in the Maclaurin expansion. However this is 
accompanied by two overheads. First of all this offers 
a small overhead while testing since the test vectors 
are (d + Z?)-dimensional instead of Z?-dimensional if 
we were to use only random features (as is the case 
with RF). 

A more subtle overhead comes at feature map appli- 
cation time since the use of HO/1 implies that, on 
an average, each of the D feature maps is estimating 
a higher order term (as compared to RF) which re- 
quires more randomness. Moreover, as it takes longer 
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(a) Homogeneous kernel 



(b) Polynomial kernel 



(c) Exponential kernel 



Figure 1: Error rates achieved by random feature maps on three dot product kernels. Plots of different colors 



represent various values of input dimension d. In Figures lb and lc thin plots represent non-HO/1 experiments 
and thick plots of same color represent results for the same value of input dimension d but with HO/1. 



Dataset 


K + LIBSVM 


RF + LIBLINEAR 


HO/1 + LIBLIN- 
EAR 


Nursery 

N = 13000 

d = 8 


acc = 99.9% 
trn = 18.6s 
tst = 3.37s 


acc = 99.7% 
trn = 3.96s (4.7x) 
tst = 0.63s (5.3x3 
D = 500 


acc = 98.2% 
trn = 0.49s (38x) 
tst = 0.1s (33x) 
D = 100 


Spambase 

N = 4600 
d = 57 


acc = 93.8% 
trn = 3.64s 
tst = 2.84s 


acc = 93.2% 
trn = 1.67s (2.2x) 
tst = 1.13s (2.5x3 
D = 500 


acc = 92.02% 
trn = 0.19s (19x) 
tst = 0.38s (7.5x3 
D = 50 


Cod-RNA 

N = 60000 

d = 8 


acc = 95.2% 
trn — 144.1s 
tst = 28.6s 


acc = 94.9% 
trn = 12.1s (12x) 
tst = 2.8s (10X) 
D = 500 


acc = 93.77% 
trn= 0.63s (229 x) 
tst = 0.51s (56 X) 
D = 50 


Adult 

N = 49000 
d = 123 


acc = 84.2% 
trn = 179.6s 
tst = 60.6s 


acc = 84.7% 
trn = 21.2s (8.5x) 
tst = 15.6s (3.9x3 
D = 500 


acc = 84.7% 
trn = 6.9s (26x) 
tst = 7.26s (8.4x) 
D = 100 


IJCNN 

N=141000 
d = 22 


acc = 98.4% 
trn = 164.1s 
tst — 33.4s 


acc = 97.3% 
trn = 36.5s (4.5x) 
tst = 23.3s (1.4X) 
D = 1000 


acc = 92.3% 
trn= 4.98s (33 x) 
tst = 7.5s (4.5x3 
D = 200 


Covertype 
N=581000 
d = 54 


acc = 77.4% 
trn = 160.95s 
tst = 1653.9s 


acc = 77.04% 
trn = 186.1s ( — ) 
tst = 236.8s (7x) 
D = 1000 


acc = 75.5% 
trn = 3.9s (41 X) 
tst = 70.3s (83 X) 
D = 100 



(a) Polynomial Kernel, K(x, y) = (1 + (x, y}) 



Dataset 


K + LIBSVM 


RF + LIBLINEAR 


HO/l + LIBLIN- 
EAR 


Nursery 

N = 13000 

d = 8 


acc = 99.8% 
trn = 10.8s 
tst = 1.7s 


acc = 99.6% 
trn = 2.52s (4.3 x) 
tst = 0.6s (2.8 X ) 
D = 500 


acc = 97.96% 
trn = 0.4s (27x) 
tst = 0.18s (9.4x) 
D = 100 


Spambase 
N = 4600 
d = 57 


acc = 93.5% 
trn = 3.19s 
tst = 1.89s 


acc = 92.3% 
trn = 1.9s (1.7x) 
tst = 0.6s (3.1x) 
D = 500 


acc = 92.08% 
trn = 0.19s (17x) 
tst = 0.16s (74 X) 
D = 50 


Cod-RNA 

N = 60000 

d = 8 


acc = 95.2% 
trn = 91.5s 
tst = 17.1s 


acc = 94.9% 
trn = 11.5s (8x) 
tst = 2.8s (6.1X) 
D = 500 


acc = 93.8% 
trn= 0.67s (136x) 
tst = 1.4s (12x) 
D = 50 


Adult 

N = 49000 
d = 123 


acc = 83.7% 
trn — 263.3s 
tst = 33.4s 


acc = 82.9% 
trn = 39.8s (6.6 x) 
tst = 14.3s (2.3x3 
D = 500 


acc = 84.8% 
trn = 7.18s (37x) 
tst = 9.4s (3.6x3 
D = 100 


IJCNN 

N=141000 
d = 22 


acc = 98.4% 
trn = 135.8s 
tst = 29.98s 


acc = 97.2% 
trn = 24.9s (5.5 x) 
tst = 23.4s (1.3x3 
D = 1000 


acc = 92.2% 
trn = 5.2s (26x) 
tst = 9.1s (3.3x) 
D = 200 


Covertype 
N=581000 
d = 54 


acc = 80.6% 
trn = 194.1s 
tst = 695.8s 


acc = 76.2% 
trn = 21.4s (9 x ) 
tst = 207s (3.6X) 
D = 1000 


acc — 75.5% 
trn = 3.7s (52x) 
tst = 80.4s (8.7x) 
D = 100 



(b) Exponential Kernel, K(x., y) = exp ( 



Table 1: RF, H0/1 and K denote respectively, the use of random features, H0/1 and actual kernel values. The 
first columns list the datasets, their sizes (N) and their dimensionalities (d). Subsequent columns list the number 
of random features used (D), classification accuracies (acc), training/testing times (trn/tst) and speedups (x). 



for feature maps estimating higher order terms to be 
applied (see Algorithm [T]), this results in longer fea- 
ture construction times. Hence, after D is chosen be- 
yond a certain threshold, the benefits offered by H0/1 
are overshadowed by the longer feature construction 
times and plain RF becomes more preferable in terms 
of lower test times. However, as the experiments will 
indicate, H0/1 is an attractive option for ultra fast 
learning routines for small to moderate values of D 
which, although do not increase feature construction 
time too much, offer much better classification accu- 
racies than RF. 

6.2 Toy Experiments 

In our first experiment, we tested the accuracy of the 
feature maps on the three dot product kernels K%, K p 



and K e . We sampled 100 random points from the unit 
ball in d dimensions (we used various values of d be- 
tween 10 and 200) and constructed feature maps for 
various values of D from 10 to 5000. The error incurred 
by the feature maps was taken to be the average abso- 
lute difference between the entries of the kernel matrix 
as given by the dot product kernel and that given by 
the linear kernel on the new feature space given by 
the feature maps. The results of the experiments, av- 
eraged over 5 runs are shown in Figure [l] One can 
see that in each case, the error quickly drops as we 
increase the value of D. 

We also experimented with the effect of HO / 1 on these 
toy datasets for K p and K e (Kh does not have terms 
corresponding to n = 0,1 and hence HO/1 cannot 
be applied). For sake of clarity, the X-axis in all the 
graphs in Figure [T] represent only D and not the final 
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(a) Classification accuracies achieved by non-HO/1 (green) and HO/1 (red) routines on 4 datasets 
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(b) Training times (log-scale) achieved by non-HO/1 (magenta) and HO/1 (blue) routines on the same 4 datasets 
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(c) Testing times (log-scale) achieved by non-HO/1 (gray) and HO/1 (cyan) routines on the same 4 datasets 

Figure 2: Performance of HO/1 vs non-HO/1 on four datasets. The first column corresponds to experiments 
on the Spambase dataset with the polynomial kernel. The next three columns correspond to experiments on 
Nursery with the polynomial kernel, I JCNN with the exponential kernel and Cod-RNA with the exponential kernel. 



number of features used (which is d + D for HO/1 
experiments). Also, to avoid clutter, we have omitted 
plots for certain small values of d in Figures |lb| and 
lc Notice how in all cases, HO/1 registers a sharper 
drop in error than RF. 

We note that the error rates vary considerably across 
kernels. This is due to the difference in the range of 
values taken by these kernels. With the specified val- 
ues of kernel parameters, whereas K% can only take 
values in the range [-1,1] inside B 2 (0, 1) C R d , K p 
can take values up to 1024 and K e up to 2.73. One 
notices that the error rates offered by the feature maps 
also differ in much the same way for these kernels . 

6.3 Experiments on UCI Datasets 

In our second experiment, we tested the performance 
of our feature map on benchmark datasets. In these 
experiments we used 60% of the data (subject to a 
maximum of 20000) for training and the rest as test 
data. Non-linear kernels were used alongwith LIB- 



SVM (Chang and Lin 2011 1 and random feature rou 



tines RF and HO/1 were used alongwith LIBLINEAR 
(Fan et al. 2008) for the classification tasks. Non- 



binary problems were binarized randomly for simplic- 
ity. Since the kernels being considered are unbounded, 
the lengths of all vectors were normalized using nor- 
malization constants learnt on the training sets. All 



results presented are averages across five random (but 
fixed) splits of the datasets. 

We first take a look at the performance benefits of 
HO/1 on these datasets in Figure[2] As before we sim- 
ply plot D on the X-axis even for HO/1 experiments 
for sake of clarity. We observe that in all four cases, 
HO/1 offers much higher accuracies as compared to 
RF when used with small number of random features 
(see Figure 2a). Also note that the number of extra 
features added for HO/1 is not large (avg. d = 45 
for the 6 datasets considered). As we increase the 
number of random features, HO/1 accuracies move up 
slowly. However the test feature construction overhead 
become large after a point and affects test times (see 



Figure 2c ) . The effect on training times (see Figure 2b ) 



is not so clear since the use of HO/1 also seems to of- 
fer greater separability which mitigates the training 
feature construction overhead in some cases. 

We provide details of the results in Table [T] We see 
that both RF and HO/1 offer significant speedups in 
both training and test times while offering competi- 
tive classification accuracies with HO/1 doing so at 
much lower values of D. In some cases the reduction 
in classification accuracy for HO/1 is moderate but 
is almost always accompanied with a spectacular in- 
crease in training and test speeds. 
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